SYS 6018 | Spring 2021 | University of Virginia


1 SYS 6018: Disaster Relief Project

H. Diana McSpadden

2 Introduction

When natural disasters or other emergencies result in compromised communications and transportation challenges, it can be impossible within a reasonable time to locate survivors using human-only methods. This disaster relief toy problem seeks to discover if an algorithm can effectively identify images corresponding to people who require relief.

To investigate whether this is possible a data set was provided containing RGB color values corresponding to pixels. The pixels are from high-resolution images taken by an aircraft above Haiti after the 2010 earthquake. Blue tarps had been distributed to survivors, but rescue workers did not have information about where survivors relocated after receiving the blue tarps.

Blue tarps have a distinct color when compared to other identifiable elements in Haiti. The training data set includes 63,241 RGB values for pixels in the toy problem’s images. The training data set includes a label for each pixel. The labels include:
  • Blue Tarp
  • Rooftop
  • Soil
  • Various Non-Tarp
  • Vegetation

Below the training data is explored, and various modeling methods are applied to determine if, and which method can be effectively used to identify blue tarps by RGB values.

3 Training Data / EDA

# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
library(caret)
library(boot)
library(regclass)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)
library(e1071)
library(caret)
library(boot)
filename = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng 
haiti <- read_csv(filename)
print(dim(haiti))
#> [1] 63241     4

Below are the first 6 rows of the training dataset.

head(haiti)
#> # A tibble: 6 x 4
#>   Class        Red Green  Blue
#>   <chr>      <dbl> <dbl> <dbl>
#> 1 Vegetation    64    67    50
#> 2 Vegetation    64    67    50
#> 3 Vegetation    64    66    49
#> 4 Vegetation    75    82    53
#> 5 Vegetation    74    82    54
#> 6 Vegetation    72    76    52

The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are each values between 0 and 255 used in the additive RBG color model.

3.1 Class Factor

To prepare the data for exploratory data analysis I make Class a factor.

haiti %>% 
  mutate(Class = factor(Class)) 
#> # A tibble: 63,241 x 4
#>    Class        Red Green  Blue
#>    <fct>      <dbl> <dbl> <dbl>
#>  1 Vegetation    64    67    50
#>  2 Vegetation    64    67    50
#>  3 Vegetation    64    66    49
#>  4 Vegetation    75    82    53
#>  5 Vegetation    74    82    54
#>  6 Vegetation    72    76    52
#>  7 Vegetation    71    72    51
#>  8 Vegetation    69    70    49
#>  9 Vegetation    68    70    49
#> 10 Vegetation    67    70    50
#> # ... with 63,231 more rows

Examine the numbers and percentages in each of the 5 classes:

haiti %>%
  group_by(Class) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 5 x 3
#>   Class                N  Perc
#> * <chr>            <int> <dbl>
#> 1 Blue Tarp         2022     3
#> 2 Rooftop           9903    16
#> 3 Soil             20566    33
#> 4 Various Non-Tarp  4744     8
#> 5 Vegetation       26006    41

3.1.0.1 Observations:

The records are not evenly distributed between the categories. Of the Classes, Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.

3.2 Binary Class Factor vs. 5 Class Factor

It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.

3.2.1 Create Binary DataFrame

Create a DataFrame that is only Blue Tarp, or not Blue Tarp:
  • 0 == Not a Blue Tarp
  • 1 == Is a Blue Tarp

After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001

I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I want to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values, i.e, a manual “scaling” of the new parameters.

haitiBinary =  haiti %>%
  mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))

haitiBinarySqrs = haiti %>%
  mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))

Re-examine the numbers and percentages in each of the 2 classes:

haitiBinary %>%
  group_by(ClassBinary) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 2 x 3
#>   ClassBinary     N  Perc
#> * <fct>       <int> <dbl>
#> 1 0           61219    97
#> 2 1            2022     3

3.2.2 How are red, blue and green values distributed between the 5 classes?

redplot <- ggplot(haiti, aes(x=Class, y=Red)) + 
  geom_boxplot(col='red')

greenplot <- ggplot(haiti, aes(x=Class, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplot, greenplot, blueplot)

3.2.3 How are red, blue and green values distributed between the 2 classes?

redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) + 
  geom_boxplot(col='red')

greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) + 
  geom_boxplot(col='darkgreen')

blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplotB, greenplotB, blueplotB)

### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?

redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) + 
  geom_boxplot(col='red')

greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) + 
  geom_boxplot(col='darkgreen')

blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) + 
  geom_boxplot(col='darkblue')

grid.arrange(redplotB, greenplotB, blueplotB)

3.2.3.1 Box Plot Observations

For the 5-class box plots:

“Blue Tarp” as the “positive” result, and other results as the “negative” result.

Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions

For the 2-class box plots:

If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.

Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.

Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.

There is almost no overlap in the blue data with non-blue tarps, and blue tarps.

For the 2-class box plots with the additive square values:

If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.

The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.

GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.

There is almost no overlap in the blue data with non-blue tarps, and blue tarps.

3.2.4 View the correlation between Red, Green and Blue

These correlations make sense as the pixels were of highly saturated/“additive” colors. There are few pixels in the data set with low values for R,G,B.

#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(haitiBinarySqrs[-1], progress = F)
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function

#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function

#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function

The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.

3.2.5 3-D Scatterplot

To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.

3.2.5.1 Five-Class 3-D Scatterplot

fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

fiveCat3D = fiveCat3D %>%
  layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))

fiveCat3D

5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.

The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.

3.2.5.1.1 Two-Class 3-D Scatterplot
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))

binary3D = binary3D %>%
  layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))

binary3D

2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. Visually, there is an near-distinct linear boundary between the blue tarp and non-blue tarp observations.

3.2.6 Parameter Selection:

Based on EDA, I am hopeful that my models will perform well using the following predictors:
  1. Red
  2. Green
  3. Blue
  4. GBSqr: ((Green + Blue)^2) * .001
  5. RBSqr: ((Red + Blue)^2) * .001

4 Model Training

4.1 Set-up

I so not consider normalization or additional scaling because the ranges of Red, Green, Blue, RBSqr, and GBSqr are the same.

I am using the 2-Class data set for the following reasons:
  1. The distinctions in the 2-Class data set, as seen in the 3-D scatterplot, are clear.
  2. The stated problem is to classify ‘Blue Tarp’ from the other classes. Classifying the other classes is not of interest.
  3. I am using 10-fold cross-validation to evaluate various models.

4.2 Cross-Validation Performance

For logistic regression, LDA, QDA, KNN and Penalized Logistic Regression Cross-Validation threshold performance use ROC and Accuracy for tuning.

The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
  • AUROC
  • True Positive Rate
  • False Positive Rate
  • Precision



For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive

4.3 Logistic Regression

In order to use R’s built-in factor function I set ClassBinary to a factor and order it “No”, “Yes”.
I also review the factor counts and create a dataframe named “train”.

levels(haitiBinarySqrs$ClassBinary)
#> [1] "0" "1"
levels(haitiBinarySqrs$ClassBinary)=c("No","Yes")

levels(haitiBinarySqrs$ClassBinary)
#> [1] "No"  "Yes"
fct_count(haitiBinarySqrs$ClassBinary)
#> # A tibble: 2 x 2
#>   f         n
#>   <fct> <int>
#> 1 No    61219
#> 2 Yes    2022

4.3.1 Comparison of Three Logistic Regression Models

Using 10-fold cross validation and p values in the collection (.1,.2,.3,.35,.4,.5,.6,.7,.8,.9) I tested three models:
** Model 1: Greatest Complexity:** \[ClassBinary = Blue + GBSqr + RBSqr \hspace{.3cm} | \hspace{.3cm} GBSqr = (Green + Blue)^2 \hspace{.3cm} and \hspace{.3cm} RBSqr = (Red + Blue)^2\]

** Model 2: Standard Additive Model:** \[ClassBinary = Blue + Green + Red\]

** Model 3: Simple Logistic Regression Model:** \[ClassBinary = Blue\]

library(yardstick)

set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)

# create storage variables for the p value, ROC, Specificity, and Sensitivity
k_start = 1
k_end = 17
out_nvar = k_end - k_start + 1

p_i = rep(NA, out_nvar)
# complex model measures
sensitivity_c_i = rep(NA, out_nvar)
specificity_c_i = rep(NA, out_nvar)
prec_c_i = rep(NA, out_nvar)

# additive model measures
sensitivity_a_i = rep(NA, out_nvar)
specificity_a_i = rep(NA, out_nvar)
prec_a_i = rep(NA, out_nvar)

# simple model measures
sensitivity_s_i = rep(NA, out_nvar)
specificity_s_i = rep(NA, out_nvar)
prec_s_i = rep(NA, out_nvar)

counter = 1

for (j in c(.1,.2,.3,.35,.4, .5,.6,.7,.8,.9))
{ 

  p_i[counter] = j
  
  accumulator_c_sens = 0
  accumulator_c_spec = 0
  accumulator_c_prec = 0
  accumulator_a_sens = 0
  accumulator_a_spec = 0
  accumulator_a_prec = 0
  accumulator_s_sens = 0
  accumulator_s_spec = 0
  accumulator_s_prec = 0
  
  #Perform 10 fold cross validation
  for(i in 1:10) {
    
      #Segement your data by fold using the which() function 
      testIndexes = which(folds==i,arr.ind=TRUE)
      testData = haitiBinarySqrsShuffled[testIndexes, ]
      trainData = haitiBinarySqrsShuffled[-testIndexes, ]
      
      # define complex model
      glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)
      # define additive model
      glm.fits.additive = glm(ClassBinary ~ Blue+Green+Red, binomial, data = trainData)
      # define simple model
      glm.fits.simple = glm(ClassBinary ~ Blue, binomial, data = trainData)
      
      # fit the complex model
      glm.pred.complex = glm.fits.complex %>%  augment(newdata=testData) %>% 
        dplyr::select(ClassBinary, .fitted)  %>% 
        mutate(ClassBinary=factor(ClassBinary))%>%
        mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
        mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes"))) 
      
      if (nlevels(glm.pred.complex$.prediction) > 1)
      {
        accumulator_c_sens = accumulator_c_sens + yardstick::sens(glm.pred.complex, ClassBinary, .prediction)[[3]]
        accumulator_c_spec = accumulator_c_spec + yardstick::spec(glm.pred.complex, ClassBinary, .prediction)[[3]]
        accumulator_c_prec = accumulator_c_prec + yardstick::precision(glm.pred.complex, ClassBinary, .prediction)[[3]]
      }
      
      # fit the additive model
      glm.pred.additive = glm.fits.additive %>%  augment(newdata=testData) %>% 
        dplyr::select(ClassBinary, .fitted)  %>% 
        mutate(ClassBinary=factor(ClassBinary))%>%
        mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
        mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes"))) 
      
      if (nlevels(glm.pred.additive$.prediction) > 1)
      {
        accumulator_a_sens = accumulator_a_sens + yardstick::sens(glm.pred.additive, ClassBinary, .prediction)[[3]]
        accumulator_a_spec = accumulator_a_spec + yardstick::spec(glm.pred.additive, ClassBinary, .prediction)[[3]]
        accumulator_a_prec = accumulator_a_prec + yardstick::precision(glm.pred.additive, ClassBinary, .prediction)[[3]]
      }
      
      # fit the simple model
      glm.pred.simple= glm.fits.simple %>%  augment(newdata=testData) %>% 
        dplyr::select(ClassBinary, .fitted)  %>% 
        mutate(ClassBinary=factor(ClassBinary))%>%
        mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < j, "No", "Yes")) %>%
        mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes"))) 
      
      if (nlevels(glm.pred.simple$.prediction) > 1)
      {
        accumulator_s_sens = accumulator_s_sens + yardstick::sens(glm.pred.simple, ClassBinary, .prediction)[[3]]
        accumulator_s_spec = accumulator_s_spec + yardstick::spec(glm.pred.simple, ClassBinary, .prediction)[[3]]
        accumulator_s_prec = accumulator_s_prec + yardstick::precision(glm.pred.simple, ClassBinary, .prediction)[[3]]
      }
      
  }

  sensitivity_c_i[counter] = accumulator_c_sens / 10
  specificity_c_i[counter] = accumulator_c_spec / 10
  prec_c_i[counter] = accumulator_c_prec /10
  
  sensitivity_a_i[counter] = accumulator_a_sens / 10
  specificity_a_i[counter] = accumulator_a_spec / 10
  prec_a_i[counter] = accumulator_a_prec /10
  
  sensitivity_s_i[counter] = accumulator_s_sens / 10
  specificity_s_i[counter] = accumulator_s_spec / 10
  prec_s_i[counter] = accumulator_s_prec /10
  
  counter = counter + 1
}

outcome = data.frame(p_i, sensitivity_c_i, specificity_c_i, prec_c_i, sensitivity_a_i, specificity_a_i, prec_a_i, sensitivity_s_i, specificity_s_i, prec_s_i)

print(outcome, n = nrow(outcome))
#> Error in print.default(m, ..., quote = quote, right = right, max = max): invalid 'na.print' specification
SLR Model 3: ClassBinary = Blue
Only p values of .1, .2, .3 and .35 are of interest in the SLR model. This model was unable to meaningfully differentiate between blue tarps and not blue tarps. The sensitivity, specificity and precision did not approach the performance of the other two models. Model 3 will no longer be considered.
Model 1 and Model 2
The differences between Model 1, the more complex mode, and Model 2, the additive model without the square terms, are more nuanced.
  • The highest precision for both models was with p == 0.1:
  • Model 3: 0.9993
  • Model 2 == 0.9988
  • The highest specificity for both models was with p == 0.1:
  • Model 3: 0.9793
  • Model 2: 0.9630
  • The highest sensitivity for both models was with p == 0.95:
  • Model 3: 0.9996
  • Model 2: 0.9997


Model 3, with the square terms, performed better on all measures.
Because geographic distribution of blue tarps is not included in this data, I will focus on precision and specificity, i.e. the model that identifies the most blue tarps. p = 0.1 identified the most blue tarps, and had the highest precision when using 10-fold cross validation.

4.3.2 Logistic Regression Performance

Below I use the entire training data set on the model with p = 0.1 to get the training Accuracy, TPR, FPR, and Precision:

set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)

out_lr.complex = tibble()

#Perform 10 fold cross validation
for(i in 1:10) {
  
    #Segement your data by fold using the which() function 
    testIndexes = which(folds==i,arr.ind=TRUE)
    testData = haitiBinarySqrsShuffled[testIndexes, ]
    trainData = haitiBinarySqrsShuffled[-testIndexes, ]
    
    
    # define complex model
    glm.fits.complex = glm(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = trainData)

    # fit the complex model
    glm.pred.complex = glm.fits.complex %>%  augment(newdata=testData) %>% 
      dplyr::select(ClassBinary, .fitted)  %>% 
      mutate(ClassBinary=factor(ClassBinary))%>%
      mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "No", "Yes")) %>%
      mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes"))) 
    
    out_lr.complex = bind_rows(out_lr.complex, 
            tibble(truth = testData$ClassBinary, 
                   prediction = glm.pred.complex$.prediction,
                   fitted = glm.pred.complex$.fitted))
    
}

caret::confusionMatrix(out_lr.complex$prediction, out_lr.complex$truth)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    No   Yes
#>        No  60987    41
#>        Yes   232  1981
#>                                           
#>                Accuracy : 0.9957          
#>                  95% CI : (0.9951, 0.9962)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9333          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.9962          
#>             Specificity : 0.9797          
#>          Pos Pred Value : 0.9993          
#>          Neg Pred Value : 0.8952          
#>              Prevalence : 0.9680          
#>          Detection Rate : 0.9644          
#>    Detection Prevalence : 0.9650          
#>       Balanced Accuracy : 0.9880          
#>                                           
#>        'Positive' Class : No              
#> 

For Logistic Regression, my calculations for Accuracy, TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.

The following cross-validation measures were calculated with a threshold, p, of 0.1:
  • Accuracy: 0.9957
  • TPR: 1981 / 2022 = 0.98
  • FPR: 232 / 61219 = 0.0038
  • Precision: 1981 / 2212 = 0.896

4.3.3 Logistic Regression Cross-Validation ROC Curve:

library(ROCR)

# create a function so I can use repeatedly
function_ROC_AUC = function(fitted, truth, title)
{
  ##produce the numbers associated with classification table
  rates = prediction(fitted, truth)
  
  ##store the true positive and false postive rates
  roc_result = performance(rates, measure = "tpr", x.measure = "fpr")
  
  par(mfrow=c(1,2))

  ##plot ROC curve and overlay the diagonal line for random guessing
  plot(roc_result, main = title)
  lines(x = c(0,1), y = c(0,1), col="red")
  
  plot(roc_result, avg= "vertical", spread.estimate="boxplot", lwd=2,col='blue',
      show.spread.at= seq(0.1, 0.9, by=0.1),
      main= "Accuracy across cutoffs", xlab="Threshold cutoff")
  
  ##compute the AUC
  auc = performance(rates, measure = "auc")
  print(auc@y.values[[1]])
}

function_ROC_AUC(out_lr.complex$fitted, out_lr.complex$truth, "ROC For Tuned Complex Logisitic Regressions Model")

#> [1] 0.9995643

The Logistic Regression ROC-AUC for the 10-fold cross-validated training data with p=0.1 is: 0.9996.

4.4 LDA

I trained the LDA model using 10-fold cross validation.

10-fold cross validation was used for p-values from 0.1 to .95. Tuning was performed using ROC.

I tested with thresholds in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95). The best performing threshold on accuracy was 0.3. For illustrative purposes I run with 0.3, and 0.4.

#y_true

set.seed(1976)

haitiBinarySqrsShuffled = haitiBinarySqrs[sample(nrow(haitiBinarySqrs)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsShuffled)),breaks=10,labels=FALSE)

trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)

out_lda_p = tibble()

#Perform 10 fold cross validation
for(i in 1:10) {
  
    #Segement your data by fold using the which() function 
    testIndexes = which(folds==i,arr.ind=TRUE)
    testData = haitiBinarySqrsShuffled[testIndexes, ]
    trainData = haitiBinarySqrsShuffled[-testIndexes, ]
    
    train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
    
    # test with the fold's test data
    preds = predict(train_lda, newdata = testData, type="prob")
    
    #- evaluate fold k
    y_true = testData$ClassBinary
    # check the threshold
    # TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
    #for (p in c(.2, .8)){
    for (p in c(.3, .4)){
      #y_hat = tibble(hat_pred = preds$Yes > p) %>% mutate(hat_value = ifelse(preds$Yes > p))
      y_hat = preds %>% mutate(hat_value = ifelse(Yes > p, "Yes", "No")) %>% select(hat_value)
      acc = apply(y_hat, 2, function(y) mean(y == y_true))
      
        #- output
      out_lda_p = bind_rows(out_lda_p, tibble(threshold = p, accuracy = acc))
    }
}

#out_lda_p
#-- Get mean accuracy
perf_lda_p = out_lda_p %>% 
  group_by(threshold) %>% 
  summarize(mean_acc = mean(accuracy)) 

perf_lda_p %>% arrange(-mean_acc)
#> # A tibble: 2 x 2
#>   threshold mean_acc
#>       <dbl>    <dbl>
#> 1       0.3    0.994
#> 2       0.4    0.994

The threshold with the greatest accuracy using a LDA model uses a probability threshold of 0.3.

# tibble to save all predictions at the best threshold
out_best_lda = tibble()
out_all_preds = tibble()

 #Perform 10 fold cross validation
for(i in 1:10) {

  #Segement your data by fold using the which() function 
  testIndexes = which(folds==i,arr.ind=TRUE)
  testData = haitiBinarySqrsShuffled[testIndexes, ]
  trainData = haitiBinarySqrsShuffled[-testIndexes, ]
  
  train_lda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "lda", trControl = trctrl, tuneLength=10)
  
  # test with the fold's test data
  preds = predict(train_lda, newdata = testData, type="prob")
  
  #- evaluate fold k
  y_true = testData$ClassBinary
  y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.3, "Yes", "No")) %>% select(hat_value)

  #- output
  out_best_lda = bind_rows(out_best_lda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
  out_all_preds = bind_rows(out_all_preds, tibble(Yes = preds$Yes))
}
out_best_lda = out_best_lda %>% mutate(pred = factor(pred), truth = factor(truth))

caret::confusionMatrix(out_best_lda$pred, out_best_lda$truth)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    No   Yes
#>        No  61153   284
#>        Yes    66  1738
#>                                          
#>                Accuracy : 0.9945         
#>                  95% CI : (0.9939, 0.995)
#>     No Information Rate : 0.968          
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.9057         
#>                                          
#>  Mcnemar's Test P-Value : < 2.2e-16      
#>                                          
#>             Sensitivity : 0.9989         
#>             Specificity : 0.8595         
#>          Pos Pred Value : 0.9954         
#>          Neg Pred Value : 0.9634         
#>              Prevalence : 0.9680         
#>          Detection Rate : 0.9670         
#>    Detection Prevalence : 0.9715         
#>       Balanced Accuracy : 0.9292         
#>                                          
#>        'Positive' Class : No             
#> 
For LDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
  • Accuracy: 0.9945
  • TPR: 1738 / (1738 + 284): 0.859
  • FPR: 66 / (61153 + 66): 0.001
  • Precision: 0.9954

4.4.1 LDA Training ROC Curve:

function_ROC_AUC(out_all_preds$Yes, out_best_lda$truth, "ROC For Tuned LDA Model")

#> [1] 0.9944908

The LDA AUC based on 10-fold cross-validated training data with a threshold of 0.3 is: 0.99.

4.5 QDA

I trained the QDA model using 10-fold cross validation.

10-fold cross validation was used and various thresholds were evaluated for best accuracy. 0.8 produced the highest accuracy. For illustrative purposes I only test thresholds of 0.7, 0.8, an 0.9.

set.seed(1976)

# I will use the same folds calculated in the LDA code# 

trctrl <- trainControl(method = "cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T)

out_qda_p = tibble()
# TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
#for (p in c(.3, .4)){


   #Perform 10 fold cross validation
for(i in 1:10) {
  
    #Segement your data by fold using the which() function 
    testIndexes = which(folds==i,arr.ind=TRUE)
    testData = haitiBinarySqrsShuffled[testIndexes, ]
    trainData = haitiBinarySqrsShuffled[-testIndexes, ]
    
    train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
    
    # test with the fold's test data
    preds = predict(train_qda, newdata = testData, type="prob")
    
    #- evaluate fold k
    for (p in c(.7, .8, .9)){
      y_true = testData$ClassBinary
      # check the threshold
      y_hat = preds %>% mutate(hat_value = ifelse(Yes > p, "Yes", "No")) %>% select(hat_value)
      acc = apply(y_hat, 2, function(y) mean(y == y_true))
      
        #- output
      out_qda_p = bind_rows(out_qda_p, tibble(threshold = p, accuracy = acc))
    }
}
#-- Get mean accuracy
perf_qda_p = out_qda_p %>% 
  group_by(threshold) %>% 
  summarize(mean_acc = mean(accuracy)) 

perf_qda_p %>% arrange(-mean_acc)
#> # A tibble: 3 x 2
#>   threshold mean_acc
#>       <dbl>    <dbl>
#> 1       0.8    0.995
#> 2       0.9    0.995
#> 3       0.7    0.995

Based on the highest Mean Accuracy of cross validation of the training data, I am selecting a threshold of 0.8 for the QDA model.

# tibble to save all predictions at the best threshold
out_best_qda = tibble()
out_all_preds_qda = tibble()

 #Perform 10 fold cross validation
for(i in 1:10) {

  #Segement your data by fold using the which() function 
  testIndexes = which(folds==i,arr.ind=TRUE)
  testData = haitiBinarySqrsShuffled[testIndexes, ]
  trainData = haitiBinarySqrsShuffled[-testIndexes, ]
  
  train_qda <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "qda", trControl = trctrl, tuneLength=10)
  
  # test with the fold's test data
  preds = predict(train_qda, newdata = testData, type="prob")
  
  #- evaluate fold k
  y_true = testData$ClassBinary
  y_hat = preds %>% mutate(hat_value = ifelse(Yes > 0.8, "Yes", "No")) %>% select(hat_value)

  #- output
  out_best_qda = bind_rows(out_best_qda, tibble(pred = y_hat$hat_value, truth = testData$ClassBinary))
  out_all_preds_qda = bind_rows(out_all_preds_qda, tibble(Yes = preds$Yes))
}
out_best_qda = out_best_qda %>% mutate(pred = factor(pred), truth = factor(truth))

caret::confusionMatrix(out_best_qda$pred, out_best_qda$truth)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    No   Yes
#>        No  61149   254
#>        Yes    70  1768
#>                                           
#>                Accuracy : 0.9949          
#>                  95% CI : (0.9943, 0.9954)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9134          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.9989          
#>             Specificity : 0.8744          
#>          Pos Pred Value : 0.9959          
#>          Neg Pred Value : 0.9619          
#>              Prevalence : 0.9680          
#>          Detection Rate : 0.9669          
#>    Detection Prevalence : 0.9709          
#>       Balanced Accuracy : 0.9366          
#>                                           
#>        'Positive' Class : No              
#> 
For QDA, my calculations for TPR, FPR, and Precision treat ‘Yes’, i.e. “yes, it is a blue tarp”, as the positive class.
  • Accuracy: 0.9949
  • TPR: 1768 / (1768 + 254): 0.874
  • FPR: 70 / (61149 + 70): 0.001
  • Precision: 0.9959

4.5.1 QDA Training ROC Curve:

function_ROC_AUC(out_all_preds_qda$Yes, out_best_qda$truth, "ROC For Tuned QDA Model")

#> [1] 0.997386

The QDA AUC based on 10-fold cross-validated training data with a threshold of 0.8 is: 0.997.

4.6 KNN

4.6.1 Tuning Parameter \(k\)

After testing over ranges from 1 to 51, the best k was 27. The code used is shown below, but not evaluated.

set.seed(1976)

trControl.knn <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, p = 0.5)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trControl.knn, tuneGrid = expand.grid(k = 26:28))
knn.cv.model
set.seed(1976)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 17:35))
knn.cv.model

I order to speed up this Rmd file I am no longer evaluating the following code chunk that evaluated k = 27 - 51. When this chunk is run, the best k was still 27.

set.seed(1976)

knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid   = expand.grid(k = 27:51))
knn.cv.model

4.6.1.1 Results of Tuning for K

I ran 10-fold cross-validation for several ranges of k:
  • 1 - 21: Returned best k == 17
  • 17 - 35: Returned best k == 27
  • 27 - 51: Returned best k == 27



From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, appear reasonable selections for the cross-validated training data.

4.6.2 Tune the probability threshold, p, for KNN | k = 27

Running manual KNN cross validation to tune for p took significant time for my laptop to process. I ran with smaller lists of p just to get a feel for the results. I had previously ran with p = 0.5 and received accuracy of 0.996.

For processing speed time, I commented out the extending loop of threshold values, and limited the list of 0.4 and 0.5 for illustrative purposes.

library(class)

out_knn_p = tibble()

haitiBinarySqrsShuffled = haitiBinarySqrsShuffled %>% mutate(ClassBinaryTF = factor(if_else(ClassBinary == "No", FALSE, TRUE)))


#for (p in c(.1,.2,.3,.7,.8,.9)) 
#for (p in c(.2,.3,.4,.5,.6,.7,.8)) 

    for (j in 1:10)
    {
  
      #Segement your data by fold using the which() function 
      testIndexes = which(folds==j,arr.ind=TRUE)
      testData = haitiBinarySqrsShuffled[testIndexes, ]
      trainData = haitiBinarySqrsShuffled[-testIndexes, ] 

      train_knn <- train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = trainData, method = "knn", tuneGrid = data.frame(k=27))
  
      # test with the fold's test data
      preds = predict(train_knn, newdata = testData, type="prob")
      
      for (p in c(.4,.5)) 
      {
      
        #- evaluate fold k
        y_true = testData$ClassBinaryTF
        # set the threshold to p
        thres = p
        y_hat = tibble(preds$Yes > thres)
        acc = apply(y_hat, 2, function(y) mean(y == y_true))
      
          #- output
        out_knn_p = bind_rows(out_knn_p, tibble(threshold = p, accuracy = acc))
      }
}

out_knn_p
#> # A tibble: 20 x 2
#>    threshold accuracy
#>        <dbl>    <dbl>
#>  1       0.4    0.997
#>  2       0.5    0.997
#>  3       0.4    0.996
#>  4       0.5    0.997
#>  5       0.4    0.998
#>  6       0.5    0.998
#>  7       0.4    0.995
#>  8       0.5    0.996
#>  9       0.4    0.998
#> 10       0.5    0.997
#> 11       0.4    0.997
#> 12       0.5    0.997
#> 13       0.4    0.996
#> 14       0.5    0.996
#> 15       0.4    0.997
#> 16       0.5    0.997
#> 17       0.4    0.997
#> 18       0.5    0.997
#> 19       0.4    0.997
#> 20       0.5    0.997
#-- Get mean accuracy
perf_knn_p = out_knn_p %>% 
  group_by(threshold) %>% 
  summarize(mean_acc = mean(accuracy)) 

perf_knn_p %>% arrange(-mean_acc)
#> # A tibble: 2 x 2
#>   threshold mean_acc
#>       <dbl>    <dbl>
#> 1       0.5    0.997
#> 2       0.4    0.997

4.6.2.1 Results of Tuning for p:

After testing the following values of p: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, p == 0.5 is the value that produced the highest mean accuracy. This is convenient because caret’s cross-validation for KNN models default to a 0.5 threshold.

caret::confusionMatrix(knn.cv.model)
#> Error in caret::confusionMatrix(knn.cv.model): object 'knn.cv.model' not found

4.6.3 KNN k=27 Training ROC Curve:

result.knn = evalm(knn.cv.model)
#> ***MLeval: Machine Learning Model Evaluation***
#> Error in evalm(knn.cv.model): object 'knn.cv.model' not found
result.knn$roc
#> Error in eval(expr, envir, enclos): object 'result.knn' not found
k = 27 1.0 N/A 0.997 0.998 0.0313 0.999

4.7 Penalized Logistic Regression (ElasticNet)

Subset selection using glmnet/ElasticNet provides an opportunity for me to see if additional predictors are of use improving the precision of our model.

I added additional terms to the model. I selected additional additive relations between the colors because RGB color is, by design, an additive color model. The interplay between Red, Blue, and Green are intuitively significant because the combinations of these values create the visible spectrum of color.

The predictors in the ElasticNet model are:
  • Red
  • Blue
  • Green
  • GBSqr = \((Green + Blue)^2\)
  • RBSqr = \((Red + Blue)^2\)
  • RGSqr = \((Red + Green)^2\)
  • RGBSqr = \((Red + Blue + Green)^2\)
haitiBinaryFull = haitiBinarySqrs %>%
  mutate(RGSqr = I(((Red + Green)^2)),  RGBSqr = I(((Red + Green + Blue)^2))) 
head(haitiBinaryFull)
#> # A tibble: 6 x 9
#>   Class        Red Green  Blue    GBSqr    RBSqr ClassBinary    RGSqr   RGBSqr
#>   <chr>      <dbl> <dbl> <dbl> <I<dbl>> <I<dbl>> <fct>       <I<dbl>> <I<dbl>>
#> 1 Vegetation    64    67    50     13.7     13.0 No             17161    32761
#> 2 Vegetation    64    67    50     13.7     13.0 No             17161    32761
#> 3 Vegetation    64    66    49     13.2     12.8 No             16900    32041
#> 4 Vegetation    75    82    53     18.2     16.4 No             24649    44100
#> 5 Vegetation    74    82    54     18.5     16.4 No             24336    44100
#> 6 Vegetation    72    76    52     16.4     15.4 No             21904    40000

I did not scale the predictors because the glmnet library will scale to standard deviation units. For the glmnet library, the training data must be in matrix format:

frmla = as.formula(ClassBinary~Red+Green+Blue+GBSqr+RBSqr+RGSqr+RGBSqr)

x.haitiFull.train = model.matrix(frmla, data = haitiBinaryFull)[,-1] # removing the intercept term from the formula

# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiFull.train = (haitiBinaryFull %>% mutate(ClassBinary = ifelse(ClassBinary == 'No', FALSE, TRUE)))$ClassBinary

Using penalized logistic regression (PLR), I evaluated three different PLR methods: ridge, lasso, and elasticnet (a combination of ridge and lasso). I tuned both lambda and the probability threshold (p). Based on best testing of the threshold tuning parameter, I considered p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95.

For the sake of processing speed I commented out the loop for the extended p values (0.2 was the best threshold when I considered the extended list), and kept 0.1, 0.2, 0.3 for illustrative purposes.

After testing ridge, elasticnet, and lasso, lasso performed the best. For speed of processing, I am only demonstrating the loop with a == 1 == lasso.

library(glmnet)

set.seed(1976)
# number of folds
K = 10
# make folds
folds = rep(1:K, length=nrow(x.haitiFull.train)) 

out = tibble()

# LOOP FOR alpha: 0 (ridge), 0.5 (elasticnet), and 1 (lasso)
#for (a in c(0, .5, 1)){

for (a in c(1)){
  # lambda may be different for the different PLR methods, so this is decided within the alpha loop
  #-- Get lambda sequence so consistent over all folds
  lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambda

  
  #-- Loop over K folds
  for(k in unique(folds)){
      
      #- Get train/test split for fold k
      ind.train = which(folds != k)
      ind.test = which(folds == k)
      
      #- fit the alpha model over all lambdas in lam_seq
      fit.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train], 
                                 alpha = a,          # use the alpha for the loop
                                 family="binomial",  # logistic regression
                                 lambda = lam_seq)   # all models in this alpha loop use same lambda sequence
      
      #- make predictions on test set fold k
      pred = predict(fit.model, x.haitiFull.train[ind.test, ], s=lam_seq, type = "response")
      
      #- evaluate fold k
      y_true = y.haitiFull.train[ind.test]
      
      # TO DO: ADD LOOP OVER p values of .1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95
       #for (p in c(.1, .2, .3, .4, .5, .6, .7, .8, .85, .9, .95)){
      for (p in c(.1, .2, .3)){
      # set the threshold to p
      thres = p
      y_hat = pred > thres
      acc = apply(y_hat, 2, function(y) mean(y == y_true))
      
      #- output
      out = bind_rows(out, 
                tibble(accuracy = acc, 
                       lambda = lam_seq,
                       n_train = length(ind.train),
                       n_test = length(ind.test),
                       alpha = a,
                       thres = p,
                       k = k))
      }
  }
}
out %>% filter(accuracy == max(accuracy))
#> # A tibble: 10 x 7
#>    accuracy    lambda n_train n_test alpha thres     k
#>       <dbl>     <dbl>   <int>  <int> <dbl> <dbl> <int>
#>  1    0.997 0.0000139   56917   6324     1   0.1     4
#>  2    0.997 0.0000127   56917   6324     1   0.1     4
#>  3    0.997 0.0000563   56917   6324     1   0.2     4
#>  4    0.997 0.0000513   56917   6324     1   0.2     4
#>  5    0.997 0.0000222   56917   6324     1   0.3    10
#>  6    0.997 0.0000202   56917   6324     1   0.3    10
#>  7    0.997 0.0000184   56917   6324     1   0.3    10
#>  8    0.997 0.0000168   56917   6324     1   0.3    10
#>  9    0.997 0.0000153   56917   6324     1   0.3    10
#> 10    0.997 0.0000139   56917   6324     1   0.3    10
#-- Get mean accuracy
perf = out %>% 
  group_by(alpha, thres, lambda) %>% 
  summarize(mean_acc = mean(accuracy), se_acc = sd(accuracy)/k) 
#> `summarise()` has grouped output by 'alpha', 'thres', 'lambda'. You can override using the `.groups` argument.
perf %>% arrange(-mean_acc)
#> # A tibble: 2,640 x 5
#> # Groups:   alpha, thres, lambda [264]
#>    alpha thres    lambda mean_acc    se_acc
#>    <dbl> <dbl>     <dbl>    <dbl>     <dbl>
#>  1     1   0.2 0.0000563    0.996 0.000467 
#>  2     1   0.2 0.0000563    0.996 0.000233 
#>  3     1   0.2 0.0000563    0.996 0.000156 
#>  4     1   0.2 0.0000563    0.996 0.000117 
#>  5     1   0.2 0.0000563    0.996 0.0000933
#>  6     1   0.2 0.0000563    0.996 0.0000778
#>  7     1   0.2 0.0000563    0.996 0.0000667
#>  8     1   0.2 0.0000563    0.996 0.0000583
#>  9     1   0.2 0.0000563    0.996 0.0000519
#> 10     1   0.2 0.0000563    0.996 0.0000467
#> # ... with 2,630 more rows
log(5.630136e-05)
#> [1] -9.784792

The plot below charts the change in mean accuracy as the log of lambda increases for each of the 10 folds. The red error bars display the difference between the (mean accuracy - standard error accuracy) and the (max accuracy + the standard error accuracy) - giving the range of mean accuracy for log lambda. The plot demonstrates that generally accuracy decreases, and the standard error of accuracy increases as log lambda increases. The highest mean accuracy is at a log lambda -9.78.

ggplot(perf, aes(log(lambda), mean_acc)) +
  geom_point() + geom_line() +
  geom_errorbar(aes(ymin=mean_acc-se_acc, ymax=mean_acc + se_acc), 
                color="red", alpha=1)

From the manual cross-validation testing of accuracy the PLR lasso method with lambda of 5.630136e-05 and a threshold of 0.2 produced the greatest mean accuracy: 0.996.

out_confusion = tibble()

 #-- Loop over K folds
for(k in unique(folds)){
  
  #- Get train/test split for fold k
  ind.train = which(folds != k)
  ind.test = which(folds == k)
  
  #- fit the lasso
  cv.glmnet.model = glmnet(x.haitiFull.train[ind.train,], y.haitiFull.train[ind.train], 
                             alpha = 1,          
                             family="binomial",  # logistic regression
                             lambda = lam_seq)   
  
  #- make predictions on test set fold k
  pred = predict(cv.glmnet.model, x.haitiFull.train[ind.test, ], s=0.00005630136, type = "response")
  
  #- evaluate fold k
  y_true = y.haitiFull.train[ind.test]
  # set the threshold to p
  thres = 0.2
  y_hat = pred > thres

    #- output
  out_confusion = bind_rows(out_confusion, 
            tibble(truth = y_true, 
                   glmnet.fitted = y_hat))
}
#> Error in glmnet(x.haitiFull.train[ind.train, ], y.haitiFull.train[ind.train], : could not find function "glmnet"

The optimal lambda:

coef(cv.glmnet.model, 0.00005630136)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'coef': object 'cv.glmnet.model' not found

This resulting model is interesting. Cross-validated lasso, with p = 0.2 set the coefficients for GBSqr, RBSqr and RGBSqr to 0. This could be due to collinearity with Blue, or because the variables are truly not significant to the accuracy of the model. This did determine that the new variable RGSqr was significant.

out_confusion %>% 
  mutate(truth = factor(truth), glmnet.fitted = factor(glmnet.fitted)) %>%
  conf_mat(truth, glmnet.fitted)
#> Error in conf_mat(., truth, glmnet.fitted): could not find function "conf_mat"
assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family="binomial")
#> Error in assess.glmnet(pred, newy = y.haitiFull.train[ind.test], family = "binomial"): could not find function "assess.glmnet"

AUC for the lasso PLR model with a lambda of 0.00005630136 is 0.9996.

Using lasso penalized logistic regression with a threshold = 0.2, lambda = 0.00005630136:
  • Accuracy: (1941 + 61046) / 63241: 0.996
  • TPR: 1941 / 2022 = 0.96
  • FPR: 173 / 61219 = 0.0028
  • Precision: 1941 / 2114 = 0.918

I was curious to examine the performance of the model selected by lasso in a logistic regression model with a threshold also selected by lasso (0.2). This is the selected PLR model without the lambda penalty term.

set.seed(1976)
#Randomly shuffle the data
haitiBinarySqrsFullShuffled = haitiBinaryFull[sample(nrow(haitiBinaryFull)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiBinarySqrsFullShuffled)),breaks=10,labels=FALSE)

out_lr.full = tibble()

#Perform 10 fold cross validation
for(i in 1:10) {
  
    #Segement your data by fold using the which() function 
    testIndexes = which(folds==i,arr.ind=TRUE)
    testData = haitiBinarySqrsFullShuffled[testIndexes, ]
    trainData = haitiBinarySqrsFullShuffled[-testIndexes, ]
    
    # define complex model
    glm.fits.full = glm(ClassBinary ~ Blue+Green+Red+RGSqr, binomial, data = trainData)

    # fit the glmnet lasso model
    glm.pred.full = glm.fits.full %>%  augment(newdata=testData) %>% 
      dplyr::select(ClassBinary, .fitted)  %>% 
      mutate(ClassBinary=factor(ClassBinary))%>%
      mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .2, "No", "Yes")) %>%
      mutate(.prediction=fct_relevel(as_factor(.prediction), c("No", "Yes"))) 
    
    out_lr.full = bind_rows(out_lr.full, 
            tibble(truth = testData$ClassBinary, 
                   prediction = glm.pred.full$.prediction,
                   fitted = glm.pred.full$.fitted))
    
}

caret::confusionMatrix(out_lr.full$prediction, out_lr.full$truth)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    No   Yes
#>        No  61039    73
#>        Yes   180  1949
#>                                           
#>                Accuracy : 0.996           
#>                  95% CI : (0.9955, 0.9965)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.937           
#>                                           
#>  Mcnemar's Test P-Value : 2.662e-11       
#>                                           
#>             Sensitivity : 0.9971          
#>             Specificity : 0.9639          
#>          Pos Pred Value : 0.9988          
#>          Neg Pred Value : 0.9155          
#>              Prevalence : 0.9680          
#>          Detection Rate : 0.9652          
#>    Detection Prevalence : 0.9663          
#>       Balanced Accuracy : 0.9805          
#>                                           
#>        'Positive' Class : No              
#> 
function_ROC_AUC(out_lr.full$fitted, out_lr.full$truth, "ROC For Tuned Lasso Model")
#> Error in prediction(fitted, truth): could not find function "prediction"
Using logistic regression with a threshold = 0.2 and the model selected by lasso:
  • Accuracy: 0.996
  • AUC: 0.9995
  • TPR: 1949 / 2022 = 0.964
  • FPR: 180 / 61219 = 0.003
  • Precision: 1949 / 2129 = 0.915
  • Neg Pred Value: 61039 / 61112 = 0.999
  • False Pred Rate: 73 / 2022 = 0.036

It was interesting that this model performed slightly better on Precision, and FPR, but not as well based on other measures as the logistic regression model Red + Green + Blue + RBSqr + GBSqr.

4.8 Random Forest

For the Random Forrest I want to include several new predictors using combinations of Red, Blue, and Green:
  • Red
  • Blue
  • Green
  • RBSqr = (Red + Blue)^2
  • GBSqr = (Green + Blue)^2
  • RGSqr = (Red + Green)^2
  • RGBSqr = (Red + Blue + Green)^2
  • RG = (Red * Green)
  • RB = (Red * Blue)
  • GB = (Green * Blue)
haitiBinaryRF = haitiBinarySqrs %>%
  mutate(RGSqr = I(((Red + Green)^2)),  RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue)) 

head(haitiBinaryRF)
#> # A tibble: 6 x 12
#>   Class   Red Green  Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr    RG    RB    GB
#>   <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct>       <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~    64    67    50  13.7  13.0 No          17161  32761  4288  3200  3350
#> 2 Vege~    64    67    50  13.7  13.0 No          17161  32761  4288  3200  3350
#> 3 Vege~    64    66    49  13.2  12.8 No          16900  32041  4224  3136  3234
#> 4 Vege~    75    82    53  18.2  16.4 No          24649  44100  6150  3975  4346
#> 5 Vege~    74    82    54  18.5  16.4 No          24336  44100  6068  3996  4428
#> 6 Vege~    72    76    52  16.4  15.4 No          21904  40000  5472  3744  3952

The first random forest tuning parameter I will test for is number of trees and mtry. I will test ntree values by cross validation.

I did run the ntree loop for 250, 500, and 1000 trees. The best performing number of trees was 1000.

I tuned with a list of mtry values of 3,4,5,6,7,8,9 and 10. 4 was the best performing. In order to lessen the amount of timed needed to knit this Rmd I am not evaluating the following code block.

control = trainControl(method = 'repeatedcv', number = 10, repeats = 3, search = 'grid')

# possible mtry values
tunegrid = expand.grid(.mtry = c(3,4,5,6,7,8,9,10))
modellist = list()

#train with different ntree parameters
#for (ntree in c(250,500,1000)){
for (ntree in c(1000)){
  set.seed(123)
  rf.fit = train(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB,
               data = haitiBinaryRF,
               method = 'rf',
               metric = 'Accuracy',
               tuneGrid = tunegrid,
               trControl = control,
               ntree = ntree)

  modellist[[toString(ntree)]] = rf.fit
}

# look at the results of the cross validation with different ntree and mtry values
results = resamples(modellist)
summary(results)
rf.fit$bestTune
summary(rf.fit$finalModel$ntree)
as.data.frame(rf.fit$finalModel$importance) %>% arrange(MeanDecreaseGini)

The above code took such a long time to run, I am not longer evaluating in my Rmd. Here are screen shots of the output to support my further evaluation of the selected random forest model:

RF Tuning Reuslts

RF MTRY Reuslts

RF Importance Results

Random forest cross validation selected 1000 trees, with 4 predictors:
  1. Blue
  2. Read
  3. GB = (Green * Blue)
  4. RG = (Red * Blue)

Next, I need to test for the best probability threshold using the tuned tree with number of predictors considered at each split == 4, and number of trees = 1000. I will use accuracy for selecting the best p value.

set.seed(123)
#Randomly shuffle the data
haitiRFShuffled = haitiBinaryRF[sample(nrow(haitiBinarySqrs)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)

head(haitiRFShuffled)
#> # A tibble: 6 x 12
#>   Class   Red Green  Blue GBSqr RBSqr ClassBinary  RGSqr RGBSqr    RG    RB
#>   <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct>       <I<db> <I<db> <dbl> <dbl>
#> 1 Roof~   240   229   205 188.  198.  No          219961 454276 54960 49200
#> 2 Vari~    91    77    66  20.4  24.6 No           28224  54756  7007  6006
#> 3 Vege~    70    69    51  14.4  14.6 No           19321  36100  4830  3570
#> 4 Vari~   127   110    96  42.4  49.7 No           56169 110889 13970 12192
#> 5 Soil    255   255   218 224.  224.  No          260100 529984 65025 55590
#> 6 Soil    255   233   169 162.  180.  No          238144 431649 59415 43095
#> # ... with 1 more variable: GB <dbl>

I did run this loop with threshold 0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 and 0.7 performed the best.

For the sake of faster compilation time, I am limited the threshold loop to 0.6, and 0.7 to illustrate the process.

# create storage variables for the p value, Accuracy, ROC, Specificity, and Sensitivity
all_out_yhat_ytrue = tibble()
out_yhat_ytrue = tibble()
out_rf = tibble()

  # reset the fold accumulator tibble
  out_yhat_ytrue = tibble()

  #Perform 10 fold cross validation
for(i in 1:10) {
    
      #Segment data by fold using the which() function 
      testIndexes = which(folds==i,arr.ind=TRUE)
      testData = haitiRFShuffled[testIndexes, ]
      trainingData = haitiRFShuffled[-testIndexes, ]
      
      #small training and test sets
      # comment this out to get real values
      #testData = testData[1:100,]
      #trainingData= head(trainingData, 500)
      #end of small training and test sets

      # training
      rf.haiti = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = trainingData, mtry=4, ntree=1000)
      # test
      rf.preds = predict(rf.haiti, newdata = testData, type="prob")
      
      #- evaluate fold
      y_true = testData %>% mutate(ClassBinary = (ClassBinary == "Yes"), Class0or1 = ifelse(ClassBinary == FALSE,0,1))
      y_true_val = y_true$ClassBinary
      
      #y_true
      
      #for (j in c(.2,.3,.4,.5,.6,.7,.8,.9))
      for (j in c(.6,.7)) { 
      
        # set the threshold to p
        y_hat = rf.preds[,2] > j
        #y_hat
        
        #- output
        out_yhat_ytrue = bind_rows(out_yhat_ytrue, 
                tibble(thres = j, true = y_true_val, hat = y_hat, truth0or1 = y_true$Class0or1, X0 = rf.preds[,1], X1 = rf.preds[,2]))
        
        all_out_yhat_ytrue = bind_rows(all_out_yhat_ytrue,
                tibble(thres = j, true = y_true_val, hat = y_hat))
        
        acc = sum(out_yhat_ytrue$hat == out_yhat_ytrue$true) / nrow(out_yhat_ytrue)
        
        out_rf = bind_rows(out_rf, 
            tibble(accuracy = acc, thres = j))
      }
}
#out_rf

out_rf %>%
  group_by(thres) %>%
  summarize_at(vars(accuracy), list(mean_acc=mean))
#> # A tibble: 2 x 2
#>   thres mean_acc
#> * <dbl>    <dbl>
#> 1   0.6    0.997
#> 2   0.7    0.997
Random forest cross validation with:
  • ntree = 1000
  • mtry = 4
  • threshold = 0.7

These tuning parameters produced the best accuracy.

Next, I produce the confusion matrix and ROC curve

# filter only the true and hat values where thres == 0.3
best_rf_df = all_out_yhat_ytrue %>% 
  mutate(factortrue = factor(true), factorhat = factor(hat)) %>% 
  filter(thres == 0.7)

head(best_rf_df)
#> # A tibble: 6 x 5
#>   thres true  hat   factortrue factorhat
#>   <dbl> <lgl> <lgl> <fct>      <fct>    
#> 1   0.7 FALSE FALSE FALSE      FALSE    
#> 2   0.7 FALSE FALSE FALSE      FALSE    
#> 3   0.7 FALSE FALSE FALSE      FALSE    
#> 4   0.7 FALSE FALSE FALSE      FALSE    
#> 5   0.7 FALSE FALSE FALSE      FALSE    
#> 6   0.7 FALSE FALSE FALSE      FALSE
caret::confusionMatrix(best_rf_df$factorhat, best_rf_df$factortrue)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE  TRUE
#>      FALSE 61174   164
#>      TRUE     45  1858
#>                                           
#>                Accuracy : 0.9967          
#>                  95% CI : (0.9962, 0.9971)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.945           
#>                                           
#>  Mcnemar's Test P-Value : 3.289e-16       
#>                                           
#>             Sensitivity : 0.9993          
#>             Specificity : 0.9189          
#>          Pos Pred Value : 0.9973          
#>          Neg Pred Value : 0.9764          
#>              Prevalence : 0.9680          
#>          Detection Rate : 0.9673          
#>    Detection Prevalence : 0.9699          
#>       Balanced Accuracy : 0.9591          
#>                                           
#>        'Positive' Class : FALSE           
#> 

Produce the ROC and calculate the AUC for the tuned random forest model:

function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$truth0or1, "ROC For Tuned RF Model")

#> [1] 0.9942587

4.9 SVM

For the support vector machine model I will use the same variables as the random forest model:
  • Red
  • Blue
  • Green
  • RBSqr = (Red + Blue)^2
  • GBSqr = (Green + Blue)^2
  • RGSqr = (Red + Green)^2
  • RGBSqr = (Red + Blue + Green)^2
  • RG = (Red * Green)
  • RB = (Red * Blue)
  • GB = (Green * Blue)

4.9.1 Tuning the Support Vector Machine

Based on the 3D visualization of the dataset I generated earlier, I believe the data will separate well with a linear kernel. Using cross validation, I will tune a linear kernel with cost values that include 0.001, 0.01, 0.1, 1, 10, and 100.

I ran the svm with possible costs including 0.01, 0.1, 1, and 10 and probability thresholds .2, .3, .4, .5, .6, .7, and .8. The best performing model used cost == 1 and threshold == 0.2. For speed of knitting this Rmd I only include those values in the code below:

out_svm = tibble()
out_svm_accuracy = tibble()

#Perform 10 fold cross validation with various costs

#Create 10 equally size folds
set.seed(1976)
folds <- cut(seq(1,nrow(haitiRFShuffled)),breaks=10,labels=FALSE)


#for (cost_val in c( 0.01, 0.1, 1, 10))
for (cost_val in c(1))
{
  for(i in 1:10)
  {
      
      # Use my previously generated folds
      #Segement your data by fold using the which() function 
      testIndexes = which(folds==i,arr.ind=TRUE)
      testData = haitiRFShuffled[testIndexes, ]
      trainData = haitiRFShuffled[-testIndexes, ]
  
      svm_model = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=trainData, kernel="linear", ranges=list(cost=c(cost_val)), probability=TRUE)
      
      pred = predict(svm_model$best.model, newdata = testData, probability=TRUE)
      
      # to speed this up, loop through the preds after the svm has been fit.
      #for (thres in c(.2,.3,.4,.5,.6,.7,.8))
      for (thres in c(.2))
      {
          y_hat = attr(pred, "probabilities")[,"Yes"] > thres
          
          acc = sum(y_hat == (testData$ClassBinary == "1")) / nrow(testData)
          
          out_svm = bind_rows(out_svm, 
                tibble(thres = thres, cost = cost_val, pred_value = y_hat, truth_value = testData$ClassBinary, accuracy = acc, noProb = attr(pred, "probabilities")[,"No"], yesProb = attr(pred, "probabilities")[,"Yes"]))
      }
  }
}
out_svm %>% group_by(thres, cost) %>% summarize(acc = mean(accuracy)) %>% arrange(desc(acc))
#> `summarise()` has grouped output by 'thres'. You can override using the `.groups` argument.
#> # A tibble: 1 x 3
#> # Groups:   thres [1]
#>   thres  cost   acc
#>   <dbl> <dbl> <dbl>
#> 1   0.2     1 0.967

The highest performing cross-validated svm used a cost of 1 and a threshold of 0.2.

best_svm = out_svm %>% filter(thres == 0.2 & cost == 1.00) %>% mutate(truth_value = (truth_value == "Yes")) %>% mutate(pred_value = factor(pred_value), truth_value = factor(truth_value))
head(best_svm)
#> # A tibble: 6 x 7
#>   thres  cost pred_value truth_value accuracy noProb     yesProb
#>   <dbl> <dbl> <fct>      <fct>          <dbl>  <dbl>       <dbl>
#> 1   0.2     1 FALSE      FALSE          0.966   1.00 0.000000100
#> 2   0.2     1 FALSE      FALSE          0.966   1.00 0.000348   
#> 3   0.2     1 FALSE      FALSE          0.966   1.00 0.000351   
#> 4   0.2     1 FALSE      FALSE          0.966   1.00 0.000103   
#> 5   0.2     1 FALSE      FALSE          0.966   1.00 0.000000100
#> 6   0.2     1 FALSE      FALSE          0.966   1.00 0.000000100
caret::confusionMatrix(best_svm$pred_value, best_svm$truth_value)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE  TRUE
#>      FALSE 61063    75
#>      TRUE    156  1947
#>                                           
#>                Accuracy : 0.9963          
#>                  95% CI : (0.9958, 0.9968)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9421          
#>                                           
#>  Mcnemar's Test P-Value : 1.413e-07       
#>                                           
#>             Sensitivity : 0.9975          
#>             Specificity : 0.9629          
#>          Pos Pred Value : 0.9988          
#>          Neg Pred Value : 0.9258          
#>              Prevalence : 0.9680          
#>          Detection Rate : 0.9656          
#>    Detection Prevalence : 0.9667          
#>       Balanced Accuracy : 0.9802          
#>                                           
#>        'Positive' Class : FALSE           
#> 
function_ROC_AUC(best_svm$yesProb, best_svm$truth_value, "ROC Training SVM")
#> Error in prediction(fitted, truth): could not find function "prediction"
Using SVM with a threshold = 0.2 and a cost of 1.0:
  • Accuracy: 0.9963
  • AUC: 0.9995
  • TPR: 1947 / 2022 = 0.962
  • FPR: 156 / 61219 = 0.0025
  • Precision: 1947 / 2100 = 0.9262
  • Neg Pred Value: 61064 / 61141 = 0.999
  • False Pred Rate: 75 / 2022 = 0.038

5 Results (Cross-Validation)

Model Tuning AUROC Threshold Accuracy TPR FPR Precision
Log Reg N/A 0.9996 0.1 0.9957 0.9802 0.0038 0.896
LDA N/A 0.9945 0.3 0.9945 0.859 0.001 0.9954
QDA N/A 0.997 0.8 0.9949 0.874 0.001 0.9959
KNN k = 27 1.0 0.5 0.997 0.998 0.0313 0.999
Pen Log Reg alpha = 1, lambda = 5.630136e-05 0.9996 0.2 0.996 0.959 0.0028 0.918
Random Forest mtry = 4, ntree = 1000 0.994 0.6 0.997 0.932 0.0009 0.9706
SVM Cost = 1.0 0.9995 0.2 0.9963 0.962 0.0025 0.9262

6 Conclusions

6.1 Conclusion #1

Training Data Linearly Separates, but Model is More Than Just Blue

Visually, the training data linearly separates very well. Even the 5-class, un-transformed data set was visually separable in the 3-D R-G-B visualization of “Blue Tarps” vs. the other four classes. Collapsing the classes and transforming the variables to decrease variability of Red and Green further improved the linear separability. All the models used performed with 99% accuracy with cross validation using the training data. All except the Lasso model used the following formula:
\[ ClassBinary = Blue+Green+Red+GBSqr+RBSqr\] where \(GBSqr=(Green + Blue)^2\) and \(RBSqr=(Red + Blue)^2\)

The lasso model selected the formula: \[ClassBinary = Red + Green + Blue + (Red + Green)^2\]
The functions with the additive sqare terms performed better than the simplest model (ClassBinary = Blue), and slightly better than the basic additive model (ClassBinary = Blue + Red + Green).

Even though a pixel with only a blue tarp in it is easily identifiable, pixels partially representing blue tarps require more predictors to perform better than guessing.

I look forward to discovering if the hold-out/validation data set also separates as well.

6.2 Conclusion #2

Distribution of Classes and How It Affects Model Selection

Only 3% of the observations in the training data are labeled “Blue Tarp”. This is a very unbalanced training set. This is not unexpected because it would be extremely surprising if a high percentage of land area was covered by blue tarps. In fact, if that were the case, then I would expect it would not be challenging for aid workers to find survivors because survivors would be covering a high percentage of Haiti, or Haiti would just have blue tarps everywhere and they would be of little predictive power for finding survivors.

99% accuracy, while impressive sounding, needs to be considered within context. Accuracy is 97% if the model predicted “No” for every pixel; however no blue tarps would be identified and survivors would remain undiscovered by aid workers. The “best guess” scenario, with high accuracy, provides no value for the humans in our toy problem. 99% accuracy is better than guessing, and the closer to 100% accuracy we can get, the better.

That said, this imbalance of the data can give a false sense of faith in a model if only evaluating specificity, aka True Negative Rate: non-blue tarps. In this toy problem (as stated on Conclusion #1) a random guess of “not blue tarp” will be correct 97% of the time, thus our model must perform better than a random guess, i.e. the Negative Predictive Value (NPV). In this problem the NPV is a measure of efficiency of resource usage, i.e. are the predictions good at keeping our imagined aid workers from traveling to non blue tarps.

The NPV’s for the models are:
  • Logistic Regression Model with Sqr Terms: 0.9993
  • LDA with Complex Model: 0.9949
  • QDA with Complex Model: 0.9969
  • KNN | k = 27 with Complex Model: 0.999
  • Penalized Log Reg with Complex Model: 0.9987

Additionally, we do not want to miss blue tarps. Missing a blue tarp may result in the loss of multiple lives. The False Negative Rate (FNR) directly measures the rate of missed blue tarps.
  • Logistic Regression Model with Sqr Terms: 0.0198
  • LDA with Complex Model: 0.1563
  • QDA with Complex Model: 0.0938
  • KNN | k = 27 with Complex Model: 0.0313
  • Penalized Log Reg with Complex Model: 0.041


Before selecting any model, I recommend working with experienced aid workers and local Haitian residents with knowledge of the land to determine the choice between, and balance of these considerations. However, in this case, if we want the model that is least likely to send aid workers to non-blue tarps we will select the tuned Logistic Regression Model WIth Sqr Terms; additionally if we want the model that correctly identified the greatest percentage of blue tarps we will also select the tuned Logistic Regression Model with Sqr Terms.

The Logistic Regression Model with Sqr Terms, while lower in Precision, performed better on other measures which are more appropriate for this problem.

6.3 Conclusion #3

If time is of the essence, KNN Took a Long Time to Tune on my Laptop: Of all the models, tuning KNN for both the highest accuracy “number of neighbors” and highest accuracy probability threshold took the most time for my laptop to complete. In fact, on my laptop, I started the threshold tuning loop and went and did chores while periodically checking for completion. I am assuming that in a real-world scenario, each night the model would be validated and tuned with additional labeled data to improve performance. Adding to this concern, I only tested neighbors in the range 1 to 51, and with 60,000+ observations a larger k may have performed better, but the time needed to test additional k tuning parameters was too great.

If time is of the essence, which in this scenario it would be because getting predictions for locations of survivors is only the first step in making a plan for volunteers to reach the survivors, it may make sense to select a model that does not take as long to produce a result. The KNN model performed almost as well as the Logistic Regression with the complex model when considering Negative Predictive Value, and False Negative Rate; however the Logistic Regression Model performed better on these two metrics. However, the KNN model performed better at AUROC, and Precision than the Logistic Regression model, but with a significant increase in time to compute.

6.4 Conclusion #4

Very small penalty for additional parameters:

The penalty (\(\lambda\)) that performed the best using elasticnet penalized regression was very small. The conclusion I draw is that while lasso performed the best as accuracy when comparing ridge, elasticnet, and lasso, the penalties for the additional predictors was very small and with such a small ratio of number of parameters to number of observations, it wasn’t significantly useful, in this case, to shrink the model.

6.5 Conclusion #5

Request For Additional Information for Real-World Implementation

I am interested in how to use limited resources to reach the greatest number of vulnerable people. Without the GIS information for each pixel I am unable to calculate locations where the most blue tarps will be found to able to help the most people; however, there may not be a 1:1 relationship between number of tarps and number of people, so resource allocation becomes even trickier. My conclusion is that without the added GIS information for the observations I cannot provide information to help an efficient distribution of aid to those affected by the earthquake. And, demographic information regarding population density, and members-per-household would be of priceless value. Additionally, GIS information for pre-earthquake roads, and other geographic features would be useful.

7 Hold-out Data / EDA

7.1 Load the Hold-out Data

The following code is not evaluated in the knitted document, but is included here for completeness.

# directory
holdout_dir = "HoldOut"

# unzip
unzip(file.path(holdout_dir,"Hold+Out+Data.zip"), exdir = holdout_dir)

7.1.1 Not Blue Tarps

File 1
  • Name: orthovnir057_ROI_NON_Blue_Tarps.txt
  • NOT Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"), n_max=12)
  • 8th row is header
  • B1 == Red
  • B2 == Blue
  • B3 == Green
df_NotBlue_057 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir057_ROI_NON_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '0') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_NotBlue_057)

dim(df_NotBlue_057)
File 2
  • Name: orthovnir067_ROI_NOT_Blue_Tarps.txt
  • NOT Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"), n_max=12)
df_NotBlue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_NOT_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '0') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_NotBlue_067)

dim(df_NotBlue_067)
File 3
  • Name: orthovnir069_ROI_NOT_Blue_Tarps.txt
  • NOT Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"), n_max=12)
df_NotBlue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_NOT_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '0') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_NotBlue_069)

dim(df_NotBlue_069)
File 4
  • Name: orthovnir078_ROI_NON_Blue_Tarps.txt
  • NOT Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"), n_max=12)
df_NotBlue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_NON_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '0') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_NotBlue_078)

dim(df_NotBlue_078)

7.1.2 Blue Tarps

File 1
  • Name: orthovnir067_ROI_Blue_Tarps.txt
  • Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"), n_max=12)
df_Blue_067 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '1') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)
#> Error in file.path(holdout_dir, "orthovnir067_ROI_Blue_Tarps.txt"): object 'holdout_dir' not found
head(df_Blue_067)
#> Error in head(df_Blue_067): object 'df_Blue_067' not found
dim(df_Blue_067)
#> Error in eval(expr, envir, enclos): object 'df_Blue_067' not found
File 2
  • Name: orthovnir069_ROI_Blue_Tarps.txt
  • Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"), n_max=12)
df_Blue_069 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir069_ROI_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '1') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_Blue_069)

dim(df_Blue_069)
File 3
  • Name: orthovnir078_ROI_Blue_Tarps.txt
  • Blue Tarp
read_lines(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"), n_max=12)
df_Blue_078 = as.data.frame(read_table(file.path(holdout_dir, "orthovnir078_ROI_Blue_Tarps.txt"), 
                               skip=7, col_types = cols(';'="-", ID = "c"))) %>%
                              rename(Red = B1, Green = B2, Blue = B3) %>%
                              mutate(ClassBinary = '1') %>%
                              select(Lat, Lon, Red, Green, Blue, ClassBinary)

head(df_Blue_078)

dim(df_Blue_078)
# join the 7 dataframes of hold out data
df_HoldOut = dplyr::bind_rows(df_NotBlue_057, df_NotBlue_067, df_NotBlue_069, df_NotBlue_078, df_Blue_067, df_Blue_069, df_Blue_078)

dim(df_HoldOut)
head(df_HoldOut)
library(readr)

write_csv(df_HoldOut, "holdout.csv")

7.1.3 Entire training data set

Confirm we have the entire training data set for training each model with the tuned parameters

haitiAllTraining = haitiBinaryRF %>%
  mutate(ClassBinary = ifelse(ClassBinary == "No","0","1")) %>%
  mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo))

head(haitiAllTraining)
#> # A tibble: 6 x 13
#>   Class   Red Green  Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr    RG    RB    GB
#>   <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <chr>       <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~    64    67    50  13.7  13.0 0           17161  32761  4288  3200  3350
#> 2 Vege~    64    67    50  13.7  13.0 0           17161  32761  4288  3200  3350
#> 3 Vege~    64    66    49  13.2  12.8 0           16900  32041  4224  3136  3234
#> 4 Vege~    75    82    53  18.2  16.4 0           24649  44100  6150  3975  4346
#> 5 Vege~    74    82    54  18.5  16.4 0           24336  44100  6068  3996  4428
#> 6 Vege~    72    76    52  16.4  15.4 0           21904  40000  5472  3744  3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>

7.1.4 Load holdout.csv into a dataframe

filename_ho = 'holdout.csv'
haiti_ho <- read_csv(filename_ho)
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   Lat = col_double(),
#>   Lon = col_double(),
#>   Red = col_double(),
#>   Green = col_double(),
#>   Blue = col_double(),
#>   ClassBinary = col_double()
#> )
dim(haiti_ho)
#> [1] 2004177       6

Next, I apply the transformations used on the testing data:

haiti_ho_full = haiti_ho %>%
  mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = factor(ClassBinary)) %>%
  mutate(RGSqr = I(((Red + Green)^2)),  RGBSqr = I(((Red + Green + Blue)^2)), RG = (Red * Green), RB = (Red * Blue), GB = (Green * Blue)) %>%
  mutate(ClassBinaryYesNo = ifelse(ClassBinary == 0, "No","Yes")) %>% mutate(ClassBinaryYesNo = factor(ClassBinaryYesNo)) %>%
  mutate(ClassBinary = factor(ClassBinary))

head(haiti_ho_full)  
#> # A tibble: 6 x 14
#>     Lat   Lon   Red Green  Blue ClassBinary GBSqr RBSqr RGSqr RGBSqr    RG    RB
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <fct>       <I<d> <I<d> <I<d> <I<db> <dbl> <dbl>
#> 1  18.5 -72.4   104    89    63 0            23.1  27.9 37249  65536  9256  6552
#> 2  18.5 -72.4   101    80    60 0            19.6  25.9 32761  58081  8080  6060
#> 3  18.5 -72.4   103    87    69 0            24.3  29.6 36100  67081  8961  7107
#> 4  18.5 -72.4   107    93    72 0            27.2  32.0 40000  73984  9951  7704
#> 5  18.5 -72.4   109    99    68 0            27.9  31.3 43264  76176 10791  7412
#> 6  18.5 -72.4   103    73    53 0            15.9  24.3 30976  52441  7519  5459
#> # ... with 2 more variables: GB <dbl>, ClassBinaryYesNo <fct>
haiti_ho_full %>%
  group_by(ClassBinary) %>%
  summarize(N = n()) %>%
  mutate(Perc = round(N / sum(N), 2) * 100)
#> # A tibble: 2 x 3
#>   ClassBinary       N  Perc
#> * <fct>         <int> <dbl>
#> 1 0           1989697    99
#> 2 1             14480     1

Our test data is only 1% Blue Tarps, which is a differente distribution from our training data.

8 Results (Hold-Out)

8.1 Hold-out Results Logistic Regression

# train the selected logistic regression model with the complete training data set
glm.all.train = glm(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, binomial, data = haitiAllTraining)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.preds = glm.all.train %>%  augment(newdata=haiti_ho_full) %>% 
      mutate(.prediction=ifelse(1 - 1/(1 + exp(.fitted)) < .1, "0", "1")) %>%
      mutate(.prediction = factor(.prediction))



caret::confusionMatrix(glm.preds$.prediction, haiti_ho_full$ClassBinary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction       0       1
#>          0 1966055      89
#>          1   23642   14391
#>                                          
#>                Accuracy : 0.9882         
#>                  95% CI : (0.988, 0.9883)
#>     No Information Rate : 0.9928         
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : 0.5433         
#>                                          
#>  Mcnemar's Test P-Value : <2e-16         
#>                                          
#>             Sensitivity : 0.9881         
#>             Specificity : 0.9939         
#>          Pos Pred Value : 1.0000         
#>          Neg Pred Value : 0.3784         
#>              Prevalence : 0.9928         
#>          Detection Rate : 0.9810         
#>    Detection Prevalence : 0.9810         
#>       Balanced Accuracy : 0.9910         
#>                                          
#>        'Positive' Class : 0              
#> 

For Logistic Regression, the hold-out test results with the tuned threshold of 0.1 are:

  • Accuracy: 0.9882
  • TPR: 14391 / (14480) = 0.994
  • FPR: 23642 / (1989697) = 0.012
  • Precision: 14391 / (38033) = 0.378
function_ROC_AUC(glm.preds$.fitted, glm.preds$ClassBinary, "ROC Hold-out Logistic Regression")
#> Error in prediction(fitted, truth): could not find function "prediction"

The AUC for hold-out the test data using a logistic regression model is 0.9997.

8.2 Hold-out Results LDA

# non fits the model on the entire training set
trctrl.lda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
  
lda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "lda", trControl=trctrl.lda.selected, tuneLength = 10)
lda.test = predict(lda.cv.model.selected, newdata = haiti_ho_full, type="prob")
lda.preds = lda.test %>%
      mutate(.prediction=ifelse(Yes < .8, "0", "1")) %>%
      mutate(.prediction = factor(.prediction))

caret::confusionMatrix(lda.preds$.prediction, haiti_ho_full$ClassBinary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction       0       1
#>          0 1986970    2081
#>          1    2727   12399
#>                                           
#>                Accuracy : 0.9976          
#>                  95% CI : (0.9975, 0.9977)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.8364          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.9986          
#>             Specificity : 0.8563          
#>          Pos Pred Value : 0.9990          
#>          Neg Pred Value : 0.8197          
#>              Prevalence : 0.9928          
#>          Detection Rate : 0.9914          
#>    Detection Prevalence : 0.9925          
#>       Balanced Accuracy : 0.9275          
#>                                           
#>        'Positive' Class : 0               
#> 

For LDA, the hold-out test results with the tuned threshold of 0.8 are:

  • Accuracy: 0.9976
  • TPR: 12399 / (14480) = 0.856
  • FPR: 2727 / (1989697) = 0.001
  • Precision: 14391 / (38033) = 0.8197
function_ROC_AUC(lda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out LDA")

#> [1] 0.9966824

The AUC for hold-out the test data using an LDA is 0.997.

8.3 Hold-out Results QDA

# fit the model on the entire training set
trctrl.qda.selected = trainControl(method = "none", summaryFunction=twoClassSummary, classProbs=T)
  
qda.cv.model.selected = train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "qda", trControl=trctrl.qda.selected, tuneLength = 10)
# test with holdout data
qda.test = predict(qda.cv.model.selected, newdata = haiti_ho_full, type="prob")

#head(qda.test)
qda.preds = qda.test %>%
      mutate(.prediction=ifelse(Yes < .8, "0", "1")) %>%
      mutate(.prediction = factor(.prediction))

caret::confusionMatrix(qda.preds$.prediction, haiti_ho_full$ClassBinary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction       0       1
#>          0 1979609    8727
#>          1   10088    5753
#>                                           
#>                Accuracy : 0.9906          
#>                  95% CI : (0.9905, 0.9907)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3748          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.9949          
#>             Specificity : 0.3973          
#>          Pos Pred Value : 0.9956          
#>          Neg Pred Value : 0.3632          
#>              Prevalence : 0.9928          
#>          Detection Rate : 0.9877          
#>    Detection Prevalence : 0.9921          
#>       Balanced Accuracy : 0.6961          
#>                                           
#>        'Positive' Class : 0               
#> 

For QDA, the hold-out test results with the tuned threshold of 0.8 are:

  • Accuracy: 0.9906
  • TPR: 5753 / (14480) = 0.397
  • FPR: 10088 / (1989697) = 0.005
  • Precision: 14391 / (38033) = 0.3632
function_ROC_AUC(qda.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out QDA")

#> [1] 0.6348543

The AUC for hold-out the test data using an QDA and a threshold of 0.8 is 0.635.

8.4 Hold-out Results KNN

# fit the model on the entire training set
train_knn <- train(ClassBinaryYesNo ~ Blue+Green+Red+GBSqr+RBSqr, data = haitiAllTraining, method = "knn", tuneGrid = data.frame(k=27))

# test with the hold out data
knn.test = predict(train_knn, newdata = haiti_ho_full, type="prob")
knn.preds = knn.test %>%
      mutate(.prediction=ifelse(Yes < .5, "0", "1")) %>%
      mutate(.prediction = factor(.prediction))

caret::confusionMatrix(knn.preds$.prediction, haiti_ho_full$ClassBinary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction       0       1
#>          0 1975687    1858
#>          1   14010   12622
#>                                          
#>                Accuracy : 0.9921         
#>                  95% CI : (0.992, 0.9922)
#>     No Information Rate : 0.9928         
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : 0.6104         
#>                                          
#>  Mcnemar's Test P-Value : <2e-16         
#>                                          
#>             Sensitivity : 0.9930         
#>             Specificity : 0.8717         
#>          Pos Pred Value : 0.9991         
#>          Neg Pred Value : 0.4739         
#>              Prevalence : 0.9928         
#>          Detection Rate : 0.9858         
#>    Detection Prevalence : 0.9867         
#>       Balanced Accuracy : 0.9323         
#>                                          
#>        'Positive' Class : 0              
#> 

For QDA, the hold-out test results with the tuned threshold of 0.8 are:

  • Accuracy: 0.9921
  • TPR: 12622 / (14480) = 0.872
  • FPR: 14010 / (1989697) = 0.007
  • Precision: 12622 / (26632) = 0.4739
function_ROC_AUC(knn.test$Yes, haiti_ho_full$ClassBinary, "ROC Hold-out KNN")

#> [1] 0.9541119

The AUC for hold-out the test data using an KNN | k = 27 is 0.954.

8.5 Hold-out Results Lasso

I already have the matrix for the training data to retrain the model with the complete data set. However, I do need to create the testing matrix.

I tried using the lambda from the tuning, and it did not produce convergence on the test data.

x.haitiHoldout.test = model.matrix(frmla, data = haiti_ho_full)[,-1] # removing the intercept term from the formula

# switch ClassBinary to TRUE FALSE to facilitate the next bit of code
y.haitiHoldout.test = (haiti_ho_full %>% mutate(ClassBinary = ifelse(ClassBinary == '0', FALSE, TRUE)))$ClassBinary

#-- Get lambda sequence so consistent over all folds
lam_seq = glmnet(x.haitiFull.train, y.haitiFull.train, family="binomial", alpha=a)$lambda

Training on entire training data set, and predict with the hold out data. Evaluation of the testing data is performed with the tuned lambda and the best threshold based on accuracy.

#- fit the lasso with all the training data and teh best lambda
lasso_holdout_model = glmnet(x.haitiFull.train, y.haitiFull.train, alpha = 1,          
                           family="binomial")   

# make test predictions
lasso.test = predict(lasso_holdout_model, x.haitiHoldout.test, s = 5.630136e-05, type = "response")

lasso_out_confusion = tibble()

#- evaluate fold k
y_true = y.haitiHoldout.test
# set the threshold to p
thres = 0.2
y_hat = lasso.test > thres

  #- output
lasso_out_confusion = bind_rows(lasso_out_confusion, 
          tibble(truth = y.haitiHoldout.test, 
                 fitted = y_hat))

lasso_out_confusion = lasso_out_confusion %>% mutate(truth = factor(truth), fitted = factor(fitted))

caret::confusionMatrix(lasso_out_confusion$fitted, lasso_out_confusion$truth)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   FALSE    TRUE
#>      FALSE 1974381     100
#>      TRUE    15316   14380
#>                                           
#>                Accuracy : 0.9923          
#>                  95% CI : (0.9922, 0.9924)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.6476          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.9923          
#>             Specificity : 0.9931          
#>          Pos Pred Value : 0.9999          
#>          Neg Pred Value : 0.4842          
#>              Prevalence : 0.9928          
#>          Detection Rate : 0.9851          
#>    Detection Prevalence : 0.9852          
#>       Balanced Accuracy : 0.9927          
#>                                           
#>        'Positive' Class : FALSE           
#> 

For QDA, the hold-out test results with the tuned threshold of 0.2 are:

  • Accuracy: 0.9923
  • TPR: 14380 / (14480) = 0.993
  • FPR: 15316 / (1989697) = 0.008
  • Precision: 14380 / (29696) = 0.484
function_ROC_AUC(lasso.test, haiti_ho_full$ClassBinary, "ROC Hold-out Lasso")

#> [1] 0.9997257

The AUC for hold-out the test data using an lasso with tuned lambda and threshold of 0.2 is 0.9997.

haitiAllTraining = haitiAllTraining %>% mutate(ClassBinary = factor(ClassBinary))
head(haitiAllTraining)
#> # A tibble: 6 x 13
#>   Class   Red Green  Blue GBSqr RBSqr ClassBinary RGSqr RGBSqr    RG    RB    GB
#>   <chr> <dbl> <dbl> <dbl> <I<d> <I<d> <fct>       <I<d> <I<db> <dbl> <dbl> <dbl>
#> 1 Vege~    64    67    50  13.7  13.0 0           17161  32761  4288  3200  3350
#> 2 Vege~    64    67    50  13.7  13.0 0           17161  32761  4288  3200  3350
#> 3 Vege~    64    66    49  13.2  12.8 0           16900  32041  4224  3136  3234
#> 4 Vege~    75    82    53  18.2  16.4 0           24649  44100  6150  3975  4346
#> 5 Vege~    74    82    54  18.5  16.4 0           24336  44100  6068  3996  4428
#> 6 Vege~    72    76    52  16.4  15.4 0           21904  40000  5472  3744  3952
#> # ... with 1 more variable: ClassBinaryYesNo <fct>

8.6 Hold-out Results Random Forest

#- fit the random forest with all the training data use the best tuning
# training
rf.haiti_holdout = randomForest(ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data = haitiAllTraining, mtry=4, ntree=1000)
# test
rf.preds_holdout = predict(rf.haiti_holdout, newdata = haiti_ho_full, type="prob")

y_true = haiti_ho_full %>% mutate(ClassBinary = (ClassBinary == "1"))
y_true_val = y_true$ClassBinary
head(y_true_val)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE
#y_true = haiti_ho_full$ClassBinary
#unique(y_true)

# set the threshold to p
y_hat = rf.preds_holdout[,2] > 0.6
head(y_hat)
#>     1     2     3     4     5     6 
#> FALSE FALSE FALSE FALSE FALSE FALSE
out_yhat_ytrue = tibble()

#- output
out_yhat_ytrue = bind_rows(out_yhat_ytrue, 
        tibble(thres = j, true = y_true_val, hat = y_hat, X0 = rf.preds_holdout[,1], X1 = rf.preds_holdout[,2]))

out_yhat_ytrue = out_yhat_ytrue %>% mutate(true = factor(true), hat = factor(hat))
caret::confusionMatrix(out_yhat_ytrue$hat, out_yhat_ytrue$true)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   FALSE    TRUE
#>      FALSE 1982836    6219
#>      TRUE     6861    8261
#>                                           
#>                Accuracy : 0.9935          
#>                  95% CI : (0.9934, 0.9936)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5549          
#>                                           
#>  Mcnemar's Test P-Value : 2.086e-08       
#>                                           
#>             Sensitivity : 0.9966          
#>             Specificity : 0.5705          
#>          Pos Pred Value : 0.9969          
#>          Neg Pred Value : 0.5463          
#>              Prevalence : 0.9928          
#>          Detection Rate : 0.9894          
#>    Detection Prevalence : 0.9925          
#>       Balanced Accuracy : 0.7835          
#>                                           
#>        'Positive' Class : FALSE           
#> 

For Random Forest using 1000 trees, and 4 predictors at each split, the hold-out test results with the tuned threshold of 0.6 are:

  • Accuracy: 0.9937
  • TPR: 8520 / (14480) = 0.588
  • FPR: 6755 / (1989697) = 0.003
  • Precision: 0.5578
#head(out_yhat_ytrue)

function_ROC_AUC(out_yhat_ytrue$X1, out_yhat_ytrue$true, "ROC Hold-out Random Forest")

#> [1] 0.976642

The AUC for the testing data on the tuned Random Forest model is 0.977.

8.7 Hold-out Results Support Vector Machine

svm_model_holdout = tune(svm, ClassBinary~Red+Blue+Green+GBSqr+RBSqr+RGSqr+RGBSqr+RG+RB+GB, data=haitiAllTraining, kernel="linear", ranges=list(cost=c(1)), probability=TRUE)
      
pred = predict(svm_model_holdout$best.model, newdata = haiti_ho_full, probability=TRUE)
      

y_hat = attr(pred, "probabilities")[,"1"] > 0.2


out_svm_holdout = tibble()

# use y_true_val from random forest
out_svm_holdout = bind_rows(out_svm_holdout, 
      tibble(pred_value = y_hat, truth_value = y_true$ClassBinary, noProb = attr(pred, "probabilities")[,"0"], yesProb = attr(pred, "probabilities")[,"1"]))

out_svm_holdout = out_svm_holdout %>% mutate(pred_value = factor(pred_value), truth_value = factor(truth_value))

#head(out_svm_holdout)
caret::confusionMatrix(out_svm_holdout$pred_value, out_svm_holdout$truth_value)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   FALSE    TRUE
#>      FALSE 1968406      98
#>      TRUE    21291   14382
#>                                           
#>                Accuracy : 0.9893          
#>                  95% CI : (0.9892, 0.9895)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.5691          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.9893          
#>             Specificity : 0.9932          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.4032          
#>              Prevalence : 0.9928          
#>          Detection Rate : 0.9822          
#>    Detection Prevalence : 0.9822          
#>       Balanced Accuracy : 0.9913          
#>                                           
#>        'Positive' Class : FALSE           
#> 

For SVM using a linear kernel, with a cost == 1, the hold-out test results with the tuned threshold of 0.2 are:

  • Accuracy: 0.9895
  • TPR: 14377 / (14480) = 0.993
  • FPR: 20946 / (1989697) = 0.01
  • Precision: 0.41
function_ROC_AUC(out_svm_holdout$yesProb, out_svm_holdout$truth_value, "ROC Hold-out SVM")

#> [1] 0.999491

9 Results (Test/Validate Data)

Model Tuning AUROC Threshold Accuracy TPR FPR Precision
Log Reg N/A 0.9998 0.1 0.9882 0.994 0.012 0.38
LDA N/A 0.997 0.3 0.9976 0.856 0.001 0.8197
QDA N/A 0.635 0.8 0.9906 0.397 0.005 0.3632
KNN k = 27 0.954 0.5 0.9921 0.872 0.007 0.4739
Penalized Log Reg lamda = 5.630136e-05 0.9997 0.2 0.9923 0.993 0.008 0.484
Random Forest ntree = 1000, mtry = 4 0.977 0.6 0.9937 0.588 0.003 0.5578
SVM linear kernel, cost == 1 0.999 0.2 0.9895 0.993 0.01 0.41

10 Final Conclusions

10.0.1 Conclusion #1

10.0.2 Conclusion #2

10.0.3 Conclusion #3

10.0.4 Conclusion #4

10.0.5 Conclusion #5

10.0.6 Conclusion #6